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Abstract — This paper considers a problem where multiple 
users make repeated decisions based on their own observed 
events. The events and decisions at each time step determine 
the values of a utility function and a collection of penalty 
functions. The goal is to make distributed decisions over time to 
maximize time average utility subject to time average constraints 
on the penalties. An example is a collection of power constrained 
sensor nodes that repeatedly report their own observations to 
a fusion center. Maximum time average utility is fundamentally 
reduced because users do not know the events observed by others. 
Optimality is characterized for this distributed context. It is 
shown that optimality is achieved by correlating user decisions 
through a commonly known pseudorandom sequence. An optimal 
algorithm is developed that chooses pure strategies at each time 
step based on a set of time-varying weights. 



I. Introduction 

Consider a multi-user system that operates over discrete 
time with unit time slots t E {0,1,2, .. .}. There are N users. 
At each time slot t, each user i observes a random event Wi (t) 
and makes a control action ai{t) based on this observation. 
Let u){t) and a{t) be vectors of these values: 

Uj{t) = {uJi{t),UJ2{t),...,UJN{t)) 

oc{t) = (ai(t),a2(i),...,aAr(i)) 

For each slot t, these vectors determine the values of a 
system utility u(t) and a collection of system penalties 
Pi{t), . . . , pk (t) (for some non-negative integer K) via real- 
valued functions: 

u{t) = u{a{t),u:{t)) 
Pk{t) = pkia{t),u:{t)) yke{l,...,K} 

The functions u{-) and Pk{-) are arbitrary and can possibly be 
negative. Negative penalties can be used to represent desirable 
system rewards. 

The goal is to make distributed decisions over time that 
maximize time average utility subject to time average con- 
straints on the penalties. Central to this problem is the 
assumption that each user i can only observe cdi{t), and 
cannot observe the value of Ldj{t) for other users j ^ i. 
Further, each user i only knows its own action ai{t), but 
does not know the actions aj{t) of others. Therefore, each 
user only knows a portion of the arguments that go into the 
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functions u{a.{t),u){t)) and pk{ot{t),LL>{t)) for each slot t. 
This uncertainty fundamentally restricts the time averages that 
can be achieved. 

Specifically, assume the random event vector LL>{t) is inde- 
pendent and identically distributed (i.i.d.) over slots (possibly 
correlated over entries in each slot). The vector uj{t) takes 
values in some abstract event space J7 = 57i x J72 x ■ • • x ^n, 
where uji{t) e Hi for all i € {l,...,iV} and all slots t. 
Similarly, assume a.{t) is chosen in some abstract action 
space A — Ai X A2 y^ ■ ■ ■ X An, where ai{t) e Ai for 
all i G {1, . . . , N} and all slots t. Let u and pf. be the time 
average expected utility and penalty incurred by a particular 
algorithmic 
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The following problem is considered: 

Maximize: u 

Subject to: Pj, < Ck \fk e {I,. . .,K} 
Decisions are distributed 



(1) 
(2) 
(3) 



where cj. are a given collection of real numbers that specify 
constraints on the time average penalties. 

The constraint that decisions must be distributed, specified 
in (O, is not mathematically precise. This constraint is more 
carefully posed in Section |III] Without the distributed schedul- 
ing constraint, the problem ([T]!-© reduces to a standard prob- 
lem of stochastic network optimization and can be solved via 
the drift-plus-penalty method [IJ. Such a centralized approach 
would allow users to coordinate to form an action vector a.{t) 
based on full knowledge of the event vector u){t). The time 
average utility achieved by the best centralized algorithm can 
be strictly larger than that of the best distributed algorithm. 
This is shown for an example sensor network problem in 
Section [III 

A. Applications to sensor networks 

The above formulation is useful for a variety of stochastic 
network optimization problems where distributed agents make 
their own decisions based on partial system knowledge. An 
important example is a network of wireless sensor nodes that 

'For simplicity, it is temporarily assumed that the time averages exist. A 
more precise formulation is specified in Section lllTl using liminf and lim sup. 



repeatedly send reports about system events to a fusion center. 
The goal is to make distributed decisions that maximize time 
average quality of information. This scenario was previously 
considered by Liu et al. in IS). There, sensors can provide 
reports every slot t using one of multiple reporting formats, 
such as text, image, or video. Sensors can also choose to 
remain idle on slot t. Thus, the action spaces Ai are the same 
for all sensors i: 

ai{t) e Ai4{idle, text, image, video} yi € {1, . . . , N} 

where the notation "=" represents defined to be equal to. Each 
format requires a different amount of power and provides a 
different level of quality. For example, define Pi{t) as the 
power incurred by sensor i on slot t, where: 



P^{t) 




if ai(i) 
\f ai{t) 
\f ai{t) 
\f ai{t) 



idle 
text 
image 
video 



where ptext, Pimage, Pvideo represent powers required for each 
of the three reporting formats and satisfy: 

^ ^ Ptext ^ Pimage ^ Pvideo 

Assume that bJi{t) represents the quality that sensor i would 
bring to the fusion center if it reports the event it observes on 
slot t using the video format. Define f{ai{t)) as the fraction 
of this quality that is achieved under format ai{t): 



fMt)) = 




if ai (t) — idle 
if ai (t) = text 
if ai (t) = image 
if ai{t) — video 



where 



< 1 



< ftext < fii 

The prior work (|2| considers the problem of maximizing 
time average utility subject to a time average power constraint: 



N 

1=1 



Pt<C 



where c is some given positive number Further, that work 
restricts to the special case when the utility function is a 
separable sum of functions of user i variables, such as: 

N 

1=1 
Such separable utilities cannot model the realistic scenario 
of information saturation, where, once a certain amount of 
utility is achieved on slot t, there is little value of having ad- 
ditional sensors spend power to deliver additional information 
on that slot. The current paper considers the case of arbitrary, 
possibly non-separable utility functions. An example is: 



u{t) 



■ N 

E 

.1=1 



/(a,(i)K(i),l 



This means that once a total quality of 1 is accumulated from 
one or more sensors on slot t, there is no advantage in having 



other sensors report information on that slot. This scenario is 
significantly more challenging to solve in a distributed context. 
For example, suppose the uji{t) variables are binary valued, 
representing whether or not sensor i observes an event on slot 
t. Suppose LL!i{t) = uj2{t) = 1. Utility is maximized if either 
sensor 1 or sensor 2 decides to report in the video format. 
Power is wasted if they both send video reports. However, 
sensor 1 does not know the value of uj2{t), sensor 2 does 
not know the value of aji(t), and neither sensor knows what 
format will be selected by the other. 

B. Applications to wireless multiple access 

The general formulation of this paper can also treat simple 
forms of distributed multiple access problems. Again suppose 
there are N wireless sensors that report to a fusion center 
For each i e {1,...,A^}, define aji(t) as the quality that 
a transmission from sensor i would bring to the system if 
it transmits on slot t. Define ai{t) as a binary value that 
is 1 if sensor i transmits on slot t, and else. Assume the 
network operates according to a simple colUsion model, where 
a transmission from sensor i is successful on slot t if and only 
if it is the only sensor that transmits on that slot: 



N 



^w-E^^w 






(4) 



The above utility function is non-separable. Concurrent work 
in [|J| considers a similar utility function for wireless energy 
harvesting applications. 

C. Contributions and related work 

The framework of partial knowledge at each user is similar 
in spirit to a multi-player Bayesian game iHUS). There, the 
goal is to design competitive strategies that lead to a Nash 
equilibrium. This is significantly different from the goal of the 
current paper. The current paper is not concerned with compe- 
tition or equilibrium. Rather, there is a single utility function 
that all users desire to maximize. Distributed algorithms are 
developed to maximize time average utility subject to time 
average penalty constraints. 

This paper shows that an optimal distributed algorithm can 
be designed by having users correlate their decisions through 
an independent source of common randomness (Section HIll l. 
Related notions of commonly shared randomness are used 
in game theory to define a correlated equilibrium, which is 
typically easier to compute than a standard Nash equilibrium 
|6||7|[5||4|. For the current paper, the shared randomness is 
crucial for solving the distributed optimization problem. This 
paper shows that optimality can be achieved by using a shared 
random variable with K -\- \ possible outcomes, where K is 
the number of penalty constraints. The solution is computable 
through a linear program. Unfortunately, the linear program 
can have a very large number of variables, even for 2-user 
problems. A reduction to polynomial complexity is shown 
to be possible in certain cases (Section llVb . This paper also 
develops an online algorithm that chooses pure strategies every 
slot based on a set of weights that are updated at the end 



of each slot (Section |V]i. The onhne technique is based on 
Lyapunov optimization concepts fTlfSlfQl. 

Much prior work on network optimization treats scenarios 
where it is possible to find distributed solutions with no 
loss of optimality. For example, network flow problems that 
are described by linear or separable convex programs can 
be optimally solved in a distributed manner ifTOl lITTl lfT2l ||9| . 
Problems where network nodes want to average sensor data 
||T3l or compute convex programs 1 14| have distributed solu- 
tions. Work in ifTSi solves for an optimal vector of parameters 
associated with an infinite horizon Markov decision problem 
using distributed agents. Work in [16J[17]|18| develops dis- 
tributed multiple access methods that converge to optimality. 
However, the above problems do not have random events that 
create a fundamental gap between centralized and distributed 
performance. 

Recent work in fT9l derives structural results for distributed 
optimization in Markov decision systems with delayed infor- 
mation. Such problems do exhibit gaps between centralized 
and distributed scheduling. The use of private information in 
||T9| is similar in spirit to the assumption in the current paper 
that each user observes its own random event LOi{t). The work 
lfT9l derives a sufficient statistic for dynamic programming. It 
does not consider time average constraints and its solutions 
do not involve correlated scheduling via a pseudorandom 
sequence. Recent work in 1 3 1 considers distributed reporting of 
events with different qualities, but considers a more restrictive 
class of policies that do not use correlated scheduling. The 
current paper treats a different model than |19| and [31, and 
shows that correlated scheduling is necessary in systems with 
constraints. Further, the current paper provides complexity 
reduction results under a preferred action property (Section 
IIVI ) and provides an online algorithm that does not require 
a-priori knowledge of event probabilities (Section IVTl. 

II. Example sensor network problem 

This section illustrates the benefits of using a common 
source of randomness for a simple example network. Suppose 
the network has two sensors that operate over time slots 
t G {0,1,2,...}. Every slot, the sensors observe the state 
of a particular system and choose whether or not to report 
their observations to a fusion center Let ijJi{t) be a binary 
variable that is 1 if sensor i observes an event on slot i, and 
else. Let ai{t) and 02 (i) be the slot t decision variables, 
so that ai (t) = 1 if sensor i reports on slot t, and ai {t) = 
otherwise. Suppose the fusion center trusts sensor 1 more than 
sensor 2. The utility u{t) is: 

u{t) = min[a;i(i)ai(t) + a;2(i)a2(i)/2, 1] 

so that the deterministic function u(-) is given by: 

u{ai,a2, 0Ji,uj2) = min[a;iai + 11^20:2/2, 1] (5) 

Therefore, u{t) e {0, 1/2, 1} for afl slots t. If a;i(i) = 1 and 
sensor 1 reports on slot t, there is no utility increase if sensor 
2 also reports. 

Each report uses one unit of power. Let Pi{t) be the 
power incurred by sensor i on slot t, being 1 if it reports 



its observation, and otherwise. The power penalties for 
i e {1,2} are: 

p,{t)^a^{t) (6) 

so that pi{ai,a2,u!i,uj2) = on for i e {1,2}. Each sensor i 
can choose not to report an observation in order to save power. 
The difficulty is that neither sensor knows what event was 
observed by the other Therefore, a distributed algorithm might 
send reports from both sensors on a given slot. A centralized 
scheduler would avoid this because it wastes power without 
increasing utility. 

Suppose that uji{t) and uj2{t) are independent of each other 
and i.i.d. over slots, with: 



Pr[LOi{t) = l] 

Pr[LU2{t) = 1] 



3/4, Pr[wi(i)=0] = l/4 
1/2, Pr[i02{t)^Q\ = l/2 



To fix a specific numerical example, consider the following 
problem: 

Maximize: u (7) 

Subject to: p^ < 1/3 , P2 < 1/3 (8) 

Decisions are distributed (9) 

A. Independent reporting 

Consider the following class of independent scheduling 
algorithms: Each sensor i independently decides to report with 
probability 0i if it observes uji{t) = 1 (it does not report 
if 0Ji{t) = 0). Since u){t) is i.i.d. over slots, the resulting 
sequences {u(t)}'^Q, {pi{t)}'^Q, {p2(i)}^o ^^ ^■^■^- o'^^'" 
slots. The time averages are: 



1 



Pi 



P2 



31 



u = E[u(i)K(i) = l,c^2(i)=0].^2 
+E[uit)\^i{t)=0,u:2{t) = l]\^ 

+E[i.(t)|^i(i)=^2(t)-1]~ 

- i^^i + i^(^2/2) + l\{e, + (1 - 6^)92/2) 

For this class of algorithms, utility is maximized by choos- 
ing 61 and 6*2 to meet the power constraints with equality. This 
leads to 9i = 4/9, 62 = 2/3. The resulting utiHty is: 

u = 4/9 « 0.44444 

B. Correlated reporting 

As an alternative, consider the following three strategies: 
. Strategy 1: wi(t) = 1 =^ ai{t) = 1 (else, ai{t) = 0). 

Sensor 2 always chooses a2{t) = 0. 
. Strategy 2: W2(i) = 1 => a2{t) = I (else, 02 (i) = 0). 

Sensor 1 always chooses ai{t) = 0. 
. Strategy 3: uji{t) = 1 =^ ai(f) = 1 (else, ai{t) = 0). 
uj2it) = 1 =^ a2(t) = 1 (else, 02 (i) = 0). 
The above three strategies are pure strategies because ai{t) 
is a deterministic function of uji{t) for each sensor i. Now let 



X{t) be an external source of randomness that is commonly 
known at both sensors on slot t. Assume X{t) is independent 
of everything else in the system, and is i.i.d. over slots with; 

Pr[X{t) = 1] = 6*1 
Pr[X{t) =2] = O2 
Pr[Xit)=3] = 93 

where 61,62, 63 are probabilities that sum to 1 . Consider the 
following algorithm: On slot t, if X{t) = m then choose 
strategy m, where m e {1,2,3}. This algorithm can be 
implemented by letting X{t) be a pseudorandom sequence 
that is installed in both sensors at time 0. The resulting time 
averages are: 



Pi 



(^1+^3)1 , P2 = {0: ' ^-^' 



^i| 



Q 1 1 
^222 



93(1 



2 -I- (73; 2 
1 1 1\ 



A simple linear program can be used to compute the optimal 
61,62,63 probabilities for this algorithm structure. The result 
is 9i = 1/3, 62 = 5/9, ^3 = 1/9. The resulting time average 
utility is: 

u = 23/48 sa 0.47917 

This is strictly larger than the time average utility of 0.44444 
achieved by the independent reporting algorithm. Thus, per- 
formance can be strictly improved by correlating reports via a 
common source of randomness. Alternatively, the same time 
averages can be achieved by time sharing: The two sensors 
agree to use a periodic schedule of period 9 slots. The first 3 
slots of the period use strategy 1 , the next 5 slots use strategy 
2, and the final slot uses strategy 3. 

C. Centralized reporting 

Suppose sensors coordinate by observing (cL'i(i),a;2(i)) and 
then cooperatively selecting {ai{t),a2{t)). It turns out that 
an optimal centralized policy is as follows |[T|: Every slot t, 
observe {uji(t),uj2{t)) and choose {ai{t),a2{t)) as follows: 

. {iOiit),0J2{t)) ^ (0,0) =^ (ai(i),a2W)-(0,0). 

. (u;i(t),W2(t)) = (0,l) =^ (ai(i),a2(i)) = (0,l). 

• If {uJi{t),uj2{t)) — (1,0), independently choose: 

(1,0) with probability 8/9 
(0,0) with probability 1/9 

If {uji{t),uj2{t)) = (1, 1), independently choose: 

(0, 1) with probability 5/9 
(0,0) with probability 4/9 

The resulting optimal centraUzed time average utility is: 

u = 0.5 

This is larger than the value 0.47917 achieved by the dis- 
tributed algorithm of the previous subsection. 

The question remains: Is it possible to construct some other 
distributed algorithm that yields u > 0.47917? Results in the 
next section imply this is impossible. Thus, the correlated 
reporting algorithm of the previous subsection optimizes time 
average utility over all possible distributed algorithms that 
satisfy the constraints. Therefore, for this example, there 
is a fundamental gap between the performance of the best 
centralized algorithm and the best distributed algorithm. 



{ai{t),a2{t)) = 



{ai{t),a2{t)) 



III. Characterizing optimality 

This section considers the general N user problem and char- 
acterizes optimality over all possible distributed algorithms. 
Recall that: 

u}{t) e Q = Qi X ■ ■ ■ X fijv 
a{t) e ^ = ^1 X • • • X An 

where the vectors uj{t) are i.i.d. over slots (possibly correlated 
over entries in each slot). Assume that the sets ilj and Ai are 
finite with sizes denoted \fli\ and \Ai\. For each u) G il define: 

7r(a;) = Pr[uj{t) = w] 

Define the history H{t) by: 

H(i)A{(a;(0), a(0)), . . . , {u:{t - 1), a(t - 1))} 

This section considers all distributed algorithms, including 
those where all users know the full history H{t). Such 
information might be available through a feedback message 
that specifies (a(t),a;(t)) at the end of each slot t. Theorem 
[U shows that optimality can be achieved without this history 
information. 

First, it is important to make the distributed scheduling 
constraint (O mathematically precise. One might attempt to 
use the following condition. For all slots t, the decisions made 
by each user i E {1, . . . , N} must satisfy: 



Pr[a,it) = a,\uj,it) ^ tu^Hit)] 
= Pr[a,{t) = a^\u;{t) = u;,n{t)] 



(10) 



for all vectors lj ~ {cui, . . . ,ujn) e fli x ■ ■ ■ x fljsi and 
all ai e Ai- The condition ( fTOl i specifies that ai{t) is 
conditionally independent of {ujj{t))\j^i given uji{t), H{t). 
While this condition is indeed required, it turns out that it is 
not restrictive enough. Appendix B provides an example utility 
function for which there is an algorithm that satisfies (fTOb but 
yields expected utility strictly larger than that of any "true" 
distributed algorithm (as defined in the next subsection). 

A. The distributed scheduling constraint 

An algorithm for selecting a{t) over slots t G {0, 1,2,.. .} 
is distributed if: 

• There is an abstract set X, called a common information 
set. 

• There is a sequence of commonly known random elements 
X{t) € X such that uj{t) is independent of X{t) for each 

ie {0,1,2,...}. 

« There are deterministic functions fi{uji,X) for each i G 
{l,...,iV} of the form: 

f,:n,xX^At 

• The decisions ai{t) satisfy the following for all slots t: 

a,{t) = f,{uj,{t),X{t)) foralHe{l,...,iV} (11) 

The above definition includes a wide class of algorithms. 
Intuitively, the random elements X{t) can be designed as any 
source of common randomness on which users can base their 



decisions. For example, X{t) can be designed to have the 
form: 

X{t) = it,H{t),Y{t)) 

where Y{t) is a random element with support and distribution 
that can possibly depend on H{t) as well as past values Y{t) 
for T < t. The only restriction is that X{t) is independent of 
u){t). Because the uj{t) vectors are i.i.d. over slots, X{t) can 
be based on any events that occur before slot t. 



B. The optimization problem 

For notational convenience, define: 

Po(t) = -u{t) 
Po{a{t),u:{t)) A -u{a{t),uj{t)) 

Maximizing the time average expectation of u{t) is equivalent 
to minimizing the time average expectation of po(^)- For each 
k S {0, 1, . . . , K} and each slot i > define: 



p,it)^Y.^[p,{r)] 



T = 

The goal is to design a distributed algorithm that solves the 
following: 

Minimize: lim sup^^^^ Po (0 (12) 

Subject to: limsupj^o^ Pfc(i) < Cfc Vfc e {1, . . . , -ftT} (13) 

Condition ([IB holds \/t e {0, 1,2,.. .} (14) 

It is assumed throughout this paper that the constraints (flJl l- 
(fT4l i are feasible. Define p'q' as the infimum of all limiting 
Pg(t) values (fT2] | achievable by algorithms that satisfy the 
constraints (fT3Tl-(fT4li. The infimum is finite because po{t) takes 
values in the same bounded set for all slots t. 



C. Optimality via correlated scheduling 
A pure strategy is defined as a vector-valued function: 

9i^) = (51 (Wl), 52(^2),..., gAr(wAr)) 

where gi{uJi) £ Ai for all i £ {1, . . . , N} and all uJi G fi^. 
The function g{u)) specifies a distributed decision rule where 
each user i chooses ai as a deterministic function of uji. 
Specifically, ai — gi{iOi). The total number of pure strategy 
functions g{u}) is ni=i |-4j|'^''- Define AI as this number, and 
enumerate all these vectors by g(™'(a;) for m £ {1, . . . , M}. 
For each m £ {1, . . . , M} and k £ {0, 1, . . . , K} define: 



^(™) A 



^7rMpfc(g(")M,a;) 



(15) 



wen 



The value rj!^' is the expected value of pk (t) given that users 
implement strategy g'™^(ci;) on slot t. 

Consider a randomized algorithm that, every slot t, indepen- 
dently uses strategy gr(™^(ci;) with probability 0.,„. For each 



fc e {0, 1, . . . , K}, the expected penalty E \pk{t)] under such 
a strategy is: 

M 



m— 1 

AI 

E' 

rrt—l 



J"^) 



The following linear program optimizes over the 9„i probabil- 
ities for this specific algorithm structure: 



M 



Minimize: 



E ^--0 



(m) 



771—1 
M 



Subject to: ^ S^r^!"^ < Ck Vfc e {1, . . 



>0 Vm £{!,..., M} 



E 



= 1 



(16) 

.,K}(n) 

(18) 
(19) 



The objective ( fT6l l corresponds to minimizing K[pQ{t)], the 
constraints (VJl ensure E[pfc(i)] < Ck for k £ {1,...,K}, 
and the constraints (fT8]l-(fT9]l ensure the 6'.,„ values form a 
valid probability mass function. Such a randomized algorithm 
does not use the history T-L{t). The next theorem shows this 
algorithm structure is optimal. 

Theorem 1: Suppose the problem (fT2]i-(fT4]i is feasible. 
Then the linear program (fT6]l-(fT9]l is feasible, and the optimal 
objective value (fTST i is equal to p'^ . Furthermore, there exist 
probabilities (^i, . . . ,6'j\/) that solve the linear program and 
satisfy 6*™ > for at most K + 1 values of to e {1, . . . , M}. 
Proof: See Appendix A. ■ 



The 



71,(72, . 



IV. Reduced complexity 

linear program (fT6]l-(fT9]l uses variables 
. , 6m), where M is the number of pure strategies: 



N 



M 



Y[\-^^ 



\^^\ 



The 2-user sensor network example from Section |ll] has 
\Ai\ = \n^\ = 2 for i e {1,2}, for a total of 2^ = 4 
strategy functions gi{uJi) for each user — hence a total of 
M = 16 functions ^(tc;). However, for each user i, the two 
strategy functions gi{uJi) that give gi{0) = 1 can be removed 
from consideration (as it is useless for user i to report if it 
observes no event). Thus, the effective number of strategy 
functions gi{oJi) for each user is only two, leaving only four 
functions g{u)) ~ {gi{Ldi),g2{uj2))- The optimal probabilities 
for switching between these four is given in Section III-BI 
where it is seen that only K + I — 3 strategies have non- 
zero probabilities. 

For general problems, the value of AI can be very large. The 
remainder of this section shows that, if certain conditions hold, 
the set of strategy functions can be pruned to a smaller set 
without loss of optimality. For example, consider a two-user 



problem with binary actions, so that \Ai\ = 2 for i e {1,2}. 
Then: 

If certain conditions hold, strategies can be restricted to a set 
of size A/, where: 

M = drill + i)(|r!2| + i) 

Thus, an exponentially large set is pruned to a smaller set with 
polynomial size. 

A. The preferred action property 

Suppose the sets Ai and Vti for each user i E {1, . . . , N} 
are given by: 



A; = {0,l,...,\A,\-l} 

n, = {o,i,...,\n,\-i} 



(20) 
(21) 



For notational convenience, for each i e {1,...,A^} let 
[ctj, ai] denote the A^-dimensional vector a — {ai, . . . , un), 
where a.j is the {N— l)-dimensional vector of aj components 
for j ^ i. This notation facilitates comparison of two vectors 
that differ in just one coordinate. Define Aj and flj as the 
set of all possible {N — 1) -dimensional vectors a- and Wj, 
respectively. 

Definition 1: A penalty function p{<y.,ijS) has the preferred 
action property if for all i G {1, . . . , N}, all olj £ A-, and all 
ujj € fl-, one has: 

p{[a-,a], [w-, uj])~p{ [aj, /3] , [w j, w] ) 

> P( ["I: a] , K, 1])-P{ [a?, P] , K, 7] ) 

whenever a, (3 are values in Ai that satisfy a > /3, and cj,7 
are values in fli that satisfy oj < 7. 

Intuitively, the above definition means that if user i com- 
pares the difference in penalty under the actions a^ (t) — a and 
ai (t) — j3 (where a > /3), this difference is non-increasing in 
the user i observation uji{t) (assuming all other actions and 
events aj and ljj are held fixed). 

For example, any function p{cx., u)) that does not depend on 
u! trivially satisfies the preferred action property. This is the 
case for the pi(-) and P2{-) functions in ^ used to represent 
power expenditures for the sensor network example of Section 
nil Further, the utility function ^ in that example yields 
Poi') = ~''J'{') th^t satisfies the preferred action property, as 
shown by the next lemma. 

Lemma 1: Suppose Ai = {0, 1} for i E {1, . . . , N}, fti is 
given by (l2Tl i. and define: 



u{a, uj) = min 



N 



1=1 



(j)i{uJi)ai,b 



for some (real-valued) constant b and some (real-valued) 
non-decreasing functions 0^(0;^). Then the penalty function 
Po{a,u)) = ~u{a,u:) has the preferred action property. 

Lemma 2: Suppose At = {0,1} for i e {1,...,A^}, fti 
is given by (ISTT i. and define the utility function u{a,u)) 
according to the multi-access example equation (|4|i. Then 
the penalty function pQ{a.,(jj) — — u(Q:,a;) has the preferred 
action property. 



Lemma 3: Suppose Ai and fti are given by (l20li- (l2n i. 
Define p{a.,uj) by: 



N 



i=l 

where (f)i{uJi), ipi{ai) are non-negative functions for all i £ 
{1, . . . , N}. Suppose that for each i G {1, . . . , N}, (j)i{uji) is 
non-increasing in uji and ipi{ai) is non-decreasing in ai. Then 
p{a.,uj) has the preferred action property. 

Lemma 4: Suppose Ai and fti are given by (l20ll-(l2Tli. Sup- 
pose pi (a, w), . . . ,p]i{a,LL)) aie a collection of functions that 
have the preferred action property (where i? is a given positive 
integer). Then for any non-negative weights wi, . . . , wr, the 
following function has the preferred action property: 

R 

p{a., Uj) — y^ UJrPr{oi, U>) 

r=l 

The proofs of Lemmas [T]|4] are given in Appendix C. 

B. Independent events and reduced complexity 

Consider the special case when the components of uj{t) = 
{uJi{t), . . . , W7v(i)) are mutually independent, so that: 

N 

7T{u:) = l[q,{uj,) (22) 



i=l 



where: 



q.i{u>i)APr[uJi{t) =uji 



Without loss of generality, assume qi{i^i) > for all i E 
{1, . . . , N} and all Ui E ili- Recall that a pure strategy g{u}) 
is composed of individual strategy functions gi{uji) for each 
user i: 

g{u>) = (5i(wi),...,5Ar(wA,)) 

Theorem 2: (Non-decreasing strategy functions) Suppose 
Ai and Vli are given by (l20li-(l2ni. If all penalty functions 
■pk{a.,ijj) for k E {0,1,..., i^} have the preferred action 
property, and if the random event process uiit) satisfies the 
independence property (|22] |. then it suffices to restrict attention 
to strategy functions gi{uJi) that are non-decreasing in uji. 

Proof: The proof uses an interchange argument. Fix 
TO E {1, . . . ,M}, i E {l,...,iV}, and fix two elements 
u! and 7 in 51; that satisfy a; < 7. Suppose the linear 
program (fT6Tl-(fT9]l places weight 6m > on a strategy 
function g'^'"'(a;) that satisfies g^^oj) > 5, (7) (so the 
non-decreasing requirement is violated). The goal is to show 
this can be replaced by new strategies that do not violate the 
non-decreasing requirement for elements w and 7, without loss 
of optimality. 

Define a — g^ {lo) and j3 = 5j (7). Then a > /3. Define 
two new functions: 



g 



{7n),low 



(uJt) 



gt'^H^t) ifw, ^{^,7} 

(3 if Wi e {^,7} 



,(™)c 



{rn),hiah,^ _ J gl '(ujt) ifa;j^{tj,7} 



Unlike the original function g^{u)i), these new functions 

satisfy: 

91 (w) < 91 (7) 



{m),high 



(^) < 3. 



{m),high 



(7) 



Define g(™)^'°«'(u;) and gi"'^) •'^'9'^ (uj) by replacing the ith 
component function 5^ (wi) of g'^"^\u>) with new compo- 
nent functions .g^ °^i^i) and 9/^ *^ ("^i)' respectively. 
Let p'^'^'it) be the fcth penalty incurred in the (old) strategy 
that uses g^"^\u}) with probability Om- Let p'^'''^{t) be the 
corresponding penalty under a (new) strategy that, instead of 
using gi^™) (uj) with probability 0.,„, uses: 

. g('")''°"'(w) with probability e,nq^{l)/{q^{io) + 9^(7))- 

. g(™)'^'9''(u;) with probability ^™q,(L^)/((7,(w) + 5,(7)). 

Let tL>j(i) denote the {N — 1) -dimensional vector of com- 
ponents LOj{t) for j 7^ i. Fix any vector w- e fi-. Define aj 
as the corresponding {N — 1) -dimensional vector of g™" {loj) 
values for j ^ i. Then: 

. If ijjj{t) = ijjj, uji{t) = uj, and g(™):'o«'(a;) is used by 
the new strategy, then a;(i) = [tt'-, 1^] and: 

= Pfe([a-,/3],[a;-,w]) 
Further, since the old strategy used g^'{(jj) = a: 
pg''^(i) = pfe(gM([a;-,H),[u;-,c.]) 

. If ijjj{t) = a;-, w,(t) = 7, and g(").''«9''(a;) is used by 
the new strategy, then <jj{t) = [wj, 7] and: 

pT^{t) = pfc(g(")^'"5"([u;-,7]),[u;-,7]) 
= pfc([aj,a],[u)j,7]) 

Further, since the old strategy used g^ (7) = (3: 

prit) = pu(g^"^H[u,jM)A^iM) 

= Pfe([ai,^],K,7]) 

• Suppose oJj{t) — u)j, but neither of the above two events 
are satisfied on slot t. That is, neither of the events £1 or 
£2 are true, where: 

£1 4 {uJ^{t) = w} n {g(™)''™(a;) is used} 
£2 4 {w,(i) == 7} n {g(™)'''*9''(u>) is used} 

Thenp^'=™(i)-p°.''*(i) =0. 
It follows that: 



E[pT^it)~prit)\^iit)-~ 

*:(7) 



[Pfc ([aj, a], K, 7]) - m- (K, 13], [u;-, 7])] (23) 



[Pk {[aj, /3], [w-, uj])-pk i[aj, a], [uj, uj])] 



where the above uses the fact that Wj(t) is independent of 
ojj{t), so conditioning on u3j{t) — u)j does not change the 
distribution of Wj(t). Because Pk{-) satisfies the preferred 
action property and a > /3, a; < 7, one has: 

[pk {[a-, a], [u;-,uj])-pk ([a-,/3], [w-,w])] 
> [pk {[aj, a], [ljj, 7]) - pfc {[aj, l3], [wj, 7])] 

and hence ( |23] | is less than or equal to zero. This holds when 
conditioning on all possible values of u}j{t), and so: 

E[pr^it)-pf''it)] <o 

This holds for all penalties k E {0,1,...,-?^}, and so the 
modified algorithm still satisfies all constraints with an optimal 
value for E[po(i)]. The interchange can be repeated a finite 
number of times until all strategy functions are non-decreasing. 

■ 
In the special case of binary actions, so that Ai — {0, 1} 
for all i g {1, . • • ,N}, all non-decreasing strategy functions 
9i(ijJi) have the following form: 



gii^i) = 



if Wj < h* 

if UJ^ > h* 



(24) 



for some threshold h* E {0, 1, . . . , |^i|}- There are |0i| + 1 
such threshold functions, whereas the total number of strategy 
functions for user i is 2l^*l. Restricting to the threshold 
functions significantly decreases complexity. 

V. Online optimization 

This section presents a dynamic algorithm to solve the 
problem (fT2]l-(fT4li. The algorithm can also be viewed as an 
online solution to the linear program (fT6]l-(fT9]l. Let M be 
the number of pure strategies required for consideration in 
the linear program (where M is possibly smaller than M, 
as discussed in the previous section). Reorder the functions 
^''"^(a;) if necessary so that every slot t, the system chooses 
a strategy function in the set {g^^'{u:), . . . ,g*^^^-'(a;)}. 

Suppose all users receive feedback specifying the values of 
the penalties pi{t), . . . ,pK{t) at the end of slot t + D, where 
D is a non-negative integer that represents a system delay. For 
each constraint k E {1, . . . , K}, define a virtual queue Qk{t) 
and initialize Qk{Q) to a commonly known value (typically 
0). For each t E {0, 1,2,.. .} the queue is updated by: 



Qk(t + 1) = max[Qfc(i) + pk{t - D) - Cfe,0] 



(25) 



Each user can iterate the above equation based on information 
available at the end of slot t. Thus, all users know the value 
of Qk{t) at the beginning of each slot t. If D > 0, define 
Pki-l) = Pfc(-2) = . . . = pki-D) = 0. 

Lemma 5: Under any decision rule for choosing strategy 
functions over time, for all i > one has: 



1 *~^ 

-Y,nPk{r - D)] < Ck 



E[Qfc(i)] E[Qfe(0)] 



T=0 



Proof: From ( |25l l the following holds for all slots r E 
{0,1,2,...}: 

Qk{r + 1) > Qk{T) +pk{T-D)~ Ck 



Thus: 

Qkir + 1) - Qkir) > Pk{T -D)- Cfe 

Summing over r G {0, 1, . . . , i — 1} for i > gives: 

t-i 
Qk{t) - Qfe(O) > Y.Pk^^ -D)- ckt 

Rearranging terms proves the resuh. ■ 

Lemma |5] ensures the constraints (fTsT l are satisfied when- 
ever the condition \iint^QK[Qk{t)] /t ~ holds for all 
fc G {1, . . . , K}, a condition called mean rate stability [Jj. 

A. Lyapunov optimization 

Define Q{t) = {Qi{t), . . . ,QK{t))- Define L{t) as the 
squared norm of Q{t) (divided by 2 for convenience later): 

K 



Proof: Note that for all fc e {0, 1, . . . , K}: 

EbfcWIQW] = nPk{oc{t),uj{t))\Q{t)] 



M 



J2 ^/3™(tV(u;)pfe(9(")('^),'^ 



771 — 1 UJ^Q, 

M 



i(i)^JllQ(i)lP = jEQ/c(t) 



fc=l 

Define A{t)=L{t + 1) - i(<), called the Lyapunov drift. 
Consider the following structure for the control decisions: 
Every slot t the queues Q{t) are observed. Then a collec- 
tion of non-negative values (3m{t) are created that satisfy 

J2m=i l^mW ~ 1 (^f desired, the (3m{t) values can be 
chosen as a function of the Q{t) values). Then an index 
TO G {!,..., M} is randomly and independently chosen 
according to the probability mass function f3„i{t), and the 
decision rule gr*-™-'(a;(t)) is used for slot t. Thus, a specific 
algorithm with this structure is determined by specifying how 
the /3,n{t) probabilities are chosen on each slot t. 

Motivated by the theory in [1|, the approach is to choose 
probabilities every slot to greedily minimize a bound on the 
drift-plus-penalty expression E [A(i + D) + Vpo{t)\Q{t)], 
where y is a non-negative weight that affects a performance 
tradeoff. The _D-shifted drift term A(i + D) is different from 
|[Tj and is used because of the delayed feedback structure of the 
queue update ( |25] l. The intuition is that minimizing A(t + D) 
maintains queue stability, while adding the weighted penalty 
term Vp(){t) biases decisions in favor of lower penalties. The 
following lemma provides a bound on the drift-plus-penalty 
expression under any /3m (i) probabilities. 

Lemma 6: Fix V > 0. Under the above decision structure, 
one has for slot t: 



Sm) 



Therefore, to prove (l26l l it suffices to prove: 

E[A{t + D)\Q{t)] <B{1 + 2D) 

K 

+ Y,Qk{m[pk{t) - Ck\Q{t)] (27) 

fc=l 

To this end, squaring the queue equation (l25T l, using 
max[a,0]^ < a^, and evaluating at time t + D yields: 

Qk{t + D + lf < Qk{t + Df + {pk{t)-cuf 
+2Qkit + D){pk{t) ~ Ck) 
Summing over fc G {1, . . . , K} and dividing by 2 gives: 

fc=i 

K 



k=l 
1 ^ 

^Y.^Pk{t) - ckf 

k=l 
K 

+ J2Qk{t){pk{t)~Ck) 



k=l 
K 



■Y.^Qk{t + D)~Qu{mpu{t) 



Ck) 



k=l 



Taking conditional expectations of the above proves (l27T i upon 
application of the following inequalities (see Appendix E): 

^j2^[ip,{t)-c,r\Qit)]<B 



fe=i 



K 



Ee [(Qfc(i + D)- Qk{t)){pk{t) - Ck)\Q{t)] < 2BD 



E [A(t + D) + Vpa{t)\Q{t)] < B{1 + 2D) 



AI 



K 



i^E/5™w^o +EQ'^w 



m— 1 



fe=l 



M 



E /3™(*) 



,(™) 



Cfe 



(26) 



where r^™^ is the fcth component of r^™^ as defined in ( fTST l. 
and the constant B is defined: 



fe=i 



B. The drift-plus-penalty algorithm 

Observe that the probability mass function f3m{t) that 
minimizes the right-hand-side of ( l26b is the one that, with 
probability 1, chooses the index m £ {!,..., M} that mini- 
mizes the expression (breaking ties arbitrarily): 



K 



Bi 



, K 

{i,...,M} 2 ^ ^ V / 



k=l 



(28) 



This gives rise to the following drift-plus-penalty algorithm: 
Every slot t: 



Users observe the queue vector Q{t). 

Users select the pure decision strategy g'™^(a;), where 

m is the index that minimizes the expression (|28l l. 

The delayed penalty information pk{t — D) is observed 

and queues are updated via 



C. Performance Analysis 

Theorem 3: If the problem (fT2]i-(fT4]l is feasible, then under 
the drift-plus-penalty algorithm for any V > 0: 
m All desired constraints (fT3])-(fT4l) are satisfied. 
• For all t > 0, the time average expectation of po{t) 
satisfies: 



B{l + 2D) E[L(i:>)] 



V 



vt 



(29) 



For all t > 0, the time average expectation of Pk{t) 
satisfies the following for all fc e {1, . . . , K}: 



1 *"^ 



(30) 



The above theorem shows the time average expectation of 
Po{t) is within 0{1/V) of optimality. It can be pushed as 
close to optimal as desired by increasing the V parameter The 
tradeoff is in the amount of time required for the time average 
expected penalties to be close to their desired constraints. It 
can be shown that if D = and a mild Slater condition 
is satisfied, then the bound dSOl l can be improved to (see 
Appendix D): 



1 *"^ 

T = 



E [pk{T)] <ck+ 0{V/t) + 0{\Qg{t)/t) 



(31) 



Proof: (Theorem O Every slot r G {0,1,2,...} the 
drift-plus-penalty algorithm chooses probabilities f3„i{T) that 
minimize the right-hand-side of the expression ( |26l ). Thus: 



E [A(r + D) + Vpo(T)|Q(r)] < B{1 + 2D) 



M 



K 



yj: 



„('") 



'm' 



EQ'^^^) 



771=1 



fe = l 



M 

E' 

m— 1 



J™) 



Cfc 



where 6'„j is any alternative probability mass function defined 
over m e {!,..., M}. Using the probabilities 9„i that opti- 
mally solve the linear program (fT6]l-(fT9]l gives: 



opt 



E [A(r + D) + Vpo{t)\Q{t)] < B{1 + 2D) + Vp°^ 

Taking expectations of both sides and using iterated expecta- 
tions gives: 



E [A(t + D)] + yE [po(r)] < B{1 + 2D) + Vp 
Summing over r G {0, 1, . . . , i — 1} gives: 



opt 
'0 



E [L(t + £»)]- E [L{D)\ + y E I^ [Po(^)] < 

T = 

B{l + 2D)t + Vpl^h 

Using the fact that E [L{t + D)] > and rearranging terms 
proves 



(32) 



Again rearranging ( [32] i yields: 

lE,[L{t + D)]<{C + FV)t (33) 

where C is defined: 

C4E [£(£>)] +B{1 + 2D) 

and F is defined as a constant that satisfies the following for 
all slots r: 

F>p"f-E[po{T)] 

Such a constant exists because po{t) has a finite number of 
possible outcomes. Using the definition of L{t + D) in ( |33l ) 
gives: 

E[||Q(t + D)||2] <2{C + FV)t 

By Jensen's inequality: 



Thus: 



i[\\Qit + D)\\Y <2{C + FV)t 
,[\\Qit + D)\\] ^ l2iC + FV) 



< 



t - V t 

Using this with Lemma |5] proves (l30l ). The inequality 
immediately implies that all desired constraints are satisfied. 



D. The approximate drift-plus-penalty algorithm 

The algorithm of Section IV-BI assumes perfect knowledge 
of the rj^ values. These can be computed by ( fTSl l if the event 
probabilities tt{u)) are known. Suppose these probabilities are 
unknown, but delayed samples u){t — D) are available at the 
end of each slot t. Let VF be a positive integer that represents 
a sample size. The rj." values can be approximated by: 



-. w-i 



-c W = -^ E pM 9''"^('^(^ -D- w)), uj{t-D- w] 



d"^)t 



The approximate algorithm uses fj;, (i) values in replace of 



(777) 



in the expression 



Analysis in |20| shows that the 



^k 

performance gap between exact and approximate drift-plus- 
penalty implementations is 0{1/Vw), so that the approxi- 
mate algorithm is very close to the exact algorithm when W 
is large. 

E. Separable penalty functions 

A simpler and exact implementation is possible, without 
requiring knowledge of the probability distribution for uj{t), 
when penalty functions have the following separable form for 
all fee {0,1,..., A"}: 



N 



Pk{a,uj) ^^p^k{a^,L 



(34) 



where pik{ai,uji) are any functions of (a.j,^^) e Ai x fl^. 
Choosing an m e {1, . . . , M} that minimizes the expression 
is equivalent to observing the queues Q{t) and then 



10 



choosing a strategy function g{u}) — {gi{uJi), . . . , gN{(j-'N)) 
to minimize: 

K 



E'^M 



Ljen 



Vpo{g{uj),uj) + ^Qk{t)pk{g{i^),i^) 



fc=i 



With the structure (|34] |. this expression becomes: 



K 



EE-M 



wefi 1=1 



K 



Vpio{gi{iOi),uJi) + E Qk{t)Pik{9i{i^i), JOj) 



fe=i 



The above is minimized by the following for each i G 

{l,...,iV}: 



9i{^i) = arg mill 



K 



Vpio(aj,Wj) +EQfc(OP»fc("''^») 



fc=i 



Thus, the minimization step in the drift-plus-penalty algorithm 
reduces to having each user observe its own ijJi{t) value and 
then setting ai{t) = gi{uji{t)), where the function gi{u!i) is 
defined above. The queue update dZSl l is the same as before. 
In the special case D = 0, this is the same algorithm as 
the optimal (centralized) drift-plus-penalty algorithm of |1|. 
Hence, for separable problems, there is no optimality gap 
between centralized and distributed algorithms. 

VI. Simulations 

A. Ergodic performance for a 2 user system 

This subsection presents simulation results for the 2 user 
sensor network example of Section |ll] The approximate drift- 
plus-penalty algorithm of Section IV-DI is used with a delay of 
D = 1Q slots and a moving average window size of VF = 40 
slots. The algorithm is not aware of the system probabilities. 
The objective of this simulation is to find how close the 
achieved utility is to the optimal value u°p* = 23/48 ~ 
0.47917 computed in Section III-BI Recall that the desired 
power constraints are p^ < 1/3 for each user i £ {1, 2}. The 
table in Fig. [T] presents performance for various values of V. 
For V > 50 the achieved utility differs from optimality only 
in the fourth decimal place. 



V 


u 


Pi 


P2 


1 


0.344639 


0.259764 


0.219525 


5 


0.454557 


0.333158 


0.267161 


10 


0.472763 


0.333335 


0.300415 


25 


0.478186 


0.333346 


0.326948 


50 


0.479032 


0.333369 


0.332873 


100 


0.479218 


0.333406 


0.333334 



Fig. 1. Algorithm performance over t 
Recall that u°P^ = 23/48 » 0.47917. 



10^ slots (D = 10, W = 40). 



B. Ergodic performance for a 3 user system 

Consider a network of 3 sensors that communicate reports 
to a fusion center, similar to the example considered in Section 
mi The event processes uJi{t) for each sensor i e {1, 2, 3} take 
values in the same 10 element set Q,: 

17A{0, 1,2,3, ...,9} 



Consider binary actions ai{t) e {0,1}, where ai{t) = 1 
corresponds to sensor i sending a report, and incurs a power 
cost of 1 for that sensor. The penalty and utility functions are: 



Pt{a.i,Ui) 



Vie {1,2,3} 

QflCJl OL2LO2 



"3^3 



10 



20 



Thus, sensor 1 brings more utility than the other sensors. 

Assume wi(t),aj2(i), W3(i) are mutually independent and 
uniformly distributed over Vt. The requirements for Theorem 
|2]hold, and so one can restrict attention to the 11 threshold 
functions gi{uJi) of the type (|24] |. As it does not make sense to 
report when uji{t) = 0, the functions gi{uj) = 1 for all uj can 
be removed. This leaves only 10 threshold functions at each 
user, for a total of lO'^ = 1000 strategy functions g(™)(a;) to 
be considered every slot. The approximate drift-plus-penalty 
algorithm of Section IV-DI is simulated over t — 10^ slots 
with a delay D = 10 and for various choices of the moving 
average window size W and the parameter V. All average 
power constraints were met for all choices of V and W. The 
achieved utility is shown in Fig. |2] The utility increases to 
a limiting value as V is increased. This limiting value can 
be improved by adjusting the number of samples W used in 
the moving average. Increasing W from 40 to 200 gives a 
small improvement in performance. There is only a negligible 
improvement when W is further increased to 400 (the curves 
for W = 200 and W = 400 look identical). 



Time average utility versus V 




Fig. 2. Achieved utility u versus V for various choices of W. 

Fig. m demonstrates how the V parameter affects the rate 
of convergence to the desired constraints. The window size 
is fixed to M^ = 40 and the value ina.x\pi{t),p2{t),p^{t)] is 
plotted for t € {0, 1, . . . , 2000} (where Pi{t) is the empirical 
average power expenditure of user i up to slot t). This value 
approaches the desired constraint of 1/3 more slowly when 
V is large. The following table presents time averages after a 
longer duration of 10^ slots. 



V 


u 


Pi 


P2 


Ps 


1 


0.259400 


0.258000 


0.251310 


0.251342 


10 


0.406263 


0.333301 


0.316371 


0.316418 


50 


0.464545 


0.333357 


0.333341 


0.333342 


100 


0.467642 


0.333387 


0.333354 


0.333354 



Fig. 3. Time averages after t = 10^ slots (W = 40). 
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Average power versus time 




3 0.5 



Utility (averaged over 2000 runs) versus time (W=40) 



- - irljW"u^^-^vv-^''^^-=v-s7v^v-'Y 



User 1 power (averaged over 2000 runs) versus time (W=40) 




Yr*'^A^^J*«'^Av^T^^f^r-v~^ 



Fig. 4. An illustration of the rate of convergence to the desired constraint 1/3 Fig. 5. A sample path of average utility and power versus time. Values at 
for various choices of V. The curves plot max[p-i{t),p2{t),p^{t)] versus t. each time slot t are obtained by averaging the actual utility and power used 

by the algoiithm on that slot over 2000 independent simulation runs. 



C. Adaptation to non-ergodic changes 

The initial queue state determines the coefficient of an 
0{l/t) transient in the performance bounds of the system 
(consider the E [L{D)] /{Vt) term in (|29]l). Thus, if system 
probabiHties change abruptly, the system can be viewed as 
restarting with a different initial condition. Thus, one expects 
the system to react robustly to such changes. 

To illustrate this, consider the same 3-user system of the 
previous subsection, using V — 50, W = 40. The event 
processes ijJi{t) have the same probabilities as given in the 
previous subsection for slots t < 4000 and t > 8000. Call this 
distribution type 1. However, for slots t € {4000, . . . , 8000}, 
the ijJi{t) processes are independently chosen with a different 
distribution as follows: 

. Pr[uji{t) = 0] == Pr[ui{t) = 9] = 1/2. 

. Pr[uj2{t) = fc] = 1/4 for k e {6,7,8,9}. 

. Prlusit) = k] = 1/4 for k e {6,7,8,9}. 
This is called distribution type 2. 

Fig. |5] shows average utility and average power over the 
first 12000 slots. Values at each slot t are averaged over 2000 
independent system runs. The two dashed horizontal lines in 
the top plot of the figure are long term time average utilities 
achieved over 10^ slots under probabilities that are fixed at 
distribution type 1 and type 2, respectively. It is seen that the 
system adapts to the non-ergodic change by quickly adjusting 
to the new optimal average utility. The figure also plots average 
power of user 1 versus time, with a dashed horizontal line at 
the power constraint 1/3. A noticeable disturbance in average 
power occurs at the non-ergodic changes in distribution. 

It was observed that system performance is not very sen- 
sitive to inaccurate estimates of the 7-^™' values (results not 
shown in the figures). This suggests that, for this example, the 
virtual queues alone are sufficient to ensure the average power 
constraints are met, which, together with loose estimates for 
rj,™ , are sufficient to provide an accurate approximation to 
optimahty. 



VII. 



Conclusions 

This paper treated distributed scheduling in a multi-user 
system where users know their own observations and actions, 
but not those of others. In this context, there is a funda- 
mental performance gap between distributed and centralized 



decisions. Optimal distributed policies were constructed by 
correlating decisions via a source of common randomness. 
The optimal policy is computable via a linear program if 
all system probabilities are known, and through an online 
algorithm with virtual queues if probabilities are unknown. 
The online algorithm assumes there is delayed feedback about 
previous penalties and rewards. The algorithm was shown 
in simulation to adapt when system probabilities change. In 
the special case when the events observed at each user are 
independent and when penalty and utility functions satisfy a 
preferred action property, the number of pure strategies for 
consideration on each slot can be significantly reduced. In 
some cases, this reduces an exponentially complex algorithm 
to one that has only polynomial complexity. 

Appendix A — Proof of Theorem[T] 

This appendix proves Theorem [T] Define the {K + 1)- 
dimensional penalty vectors: 

p{t) = {po{t),pi{t),...,pK{t)) 

p{ct,u)) = {pQ{a,u;),pi{ct,uj),. . . ,pKioi,u;)) 
For each to e {1, . . . , M}, define: 

rMA^^(^)p(g(™)(a;),a;) = (r(^ 



„(™) Am) 
' 'l ' 



(m)N 



■■:' K 



Ljefi 



Define TZ as the convex hull of these vectors: 

7^ACon^;({r(l),...,r(^^)}) 

The set TZ is convex, closed, and bounded. From the nature of 
the convex hull operation, the set TZ can be viewed as the set 
of all average penalty vectors achievable by timesharing over 
the M different pure strategies. 

Lemma 7: Let a{t) be decisions of an algorithm that sat- 
isfies the distributed scheduling constraint (fTTT i on every slot. 
Then: 

(a) For all slots te {0,1,2,...}: 

E [p{t)] e TZ 

(b) For all slots te {1,2,3,...}: 

p{t) e TZ 
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where 



p(i)Ai^E[p(r)] 



Proof: Part (b) follows immediately from part (a) together 
with the fact that 7?. is convex. To prove part (a), fix a slot 
t G {0, 1,2,.. .}. By (fTTT l. the users make decisions: 

a{t) = {h{Lu,{t),X{t)), . . . , fNiLON{t),X{t))) 

For each X{t) E X and uj eVL, define: 

g^(,)(a;) = (/i(c.i,X(i)), . . . ,/jv(c.^,X(i))) 

Then, given X{t), the function gxtt){<^) is a pure strategy. 
Hence, gx(t)ii^) = 9'™H'*^) for some to e {1,...,A/}. 
Define mx{t) as the value m E {!,..., Af} for which this 
holds. Thus, gx(t)i^) = ^^'"^'''H'^)' and: 



E[p(i)|X(i)] 



Ep(a(i),u;(i))|X(t)] 
E[p(g(™-('))(a;(i)),u;(t))|X(i)' 



= r 



(™x(t)) 



Taking expectations of both sides and using the law of iterated 
expectations gives: 



M 

E [p{t)] - Y. P^i^xit) 

m—l 



m r 



(m) 



The above is a convex combination of {r'-^\ 
hence is in 7?.. 

Lemma 8: There exist real numbers ri, r2, . 
isfy the following: 

rk<Ck Vfce{l,...,i^} 
(Po^*,ri,r2,...,r/f) E 7^ 



.,r(^^)}, and 

■ 

, tk that sat- 

(35) 
(36) 

Furthermore, the vector in (|36] | is on the boundary of 7?^. 

Proof: Fix g as a positive integer Consider an algorithm 
that satisfies the distributed scheduling constraint (fTTT l every 
slot. For k E {0, 1, . . . ,iir}, let Pf^{t) be the resulting time 
average expected penalties. Assume the algorithm satisfies: 



Po^* < limsuppo(i) < pT 



1/9 



limsuppj,(i) < Ck VA;e{l,. 



(37) 
■,K} (38) 



Such an algorithm must exist because p^ is the infimum 
objective value for ( fTSl i over all algorithms that satisfy the 
constraints (fT3Tl-(fT4li. 

Lemma |7] implies that p{t) = {pQ{t), . . . ,pj^{t)) E TZ for 
all i > 0. Let i„ be a subsequence of times over which pg (i) 
achieves its limsup. Since p{tn) is in the closed and bounded 
set TZ for all t„ > 0, the Bolzano- Wierstrass theorem implies 
there is a subsequence p(in„J that converges to a point r{q) E 
7^, where r{q) = {ro{q), . ■ . ,rK{q))- Thus: 

rQ{q) = lim po(<„„ J = limsup Po(^) (39) 

rfc(q) = lim Pk{tn^) < limsuppfc(t) yk e{1,...,K} 



Using ( |38] | in the last inequality above gives: 
rkiq)<Ck ykE{l,...,K} 
Further, substituting ( |39] ) into dJTJ ) gives: 

op 

Po 



"°''*<ro(g)<pr + l/'Z 



(40) 



(41) 



This holds for all positive integers q. Thus, {r{q)}'^i is 
an infinite sequence of vectors in TZ such that r{q) satisfies 
(l40l i and (Ell for all q G {1, 2, 3, . . .}. Because TZ is closed 
and bounded, the sequence {r{q)}^i has a limit point r = 
(ro, ri, . . . , r/<-) E TZ that satisfies tq = Pq^ and r^ < Ck for 
all fc e {1, . . . , i^}. This proves ^ and (|36] |. 

To prove that r is on the boundary of 7?., it suffices to note 
that for any e > 0: 

bo^* -e,ri,...,rK) (^TZ 

Indeed, if this were not true, it would be possible to construct 
a distributed algorithm that satisfies all desired constraints and 
yields a time average expected value of po{t) equal to Pq^ — e, 
which contradicts the definition of Pq^ . ■ 

Because TZ = Conv{{r^^\ . . . jV^-'^')}), Lemma |8] implies 
there are probabiUties 9m that sum to 1 such that: 



M 



{p7 '^1'- 



.,rK)=^0™r-(™) 



Because 7?. is a {K + 1) -dimensional set, Caratheodory's 
theorem ensures the above can be written using at most K + 2 
non-zero 9m values. However, because the above vector is 
on the boundary of TZ, a simple extension of Caratheodory's 
theorem ensures it can be written using at most K+1 non-zero 
9m valuesjj This proves Theorem [T] 

Appendix B — A counterexample 

This appendix shows it is possible for an algorithm to satisfy 
the conditional independence assumption ( fTOl i while yielding 
expected utility strictly larger than that of any distributed 
algorithm. Consider a two user system with wi (i) , a;2 (i) 
independent and i.i.d. Bernoulli processes with: 

Pr[u^{t) == 1] = Pr[uJ^{t) = 0] = 1/2 Vi E {1,2} 

The actions are constrained to: 

ai(t)e{-l,l} , a2{t)E{-lA} 

Define the utility function: 

u(ai, "2,^1,^2) = 9{^i,^2)oiia2 

where (7(0^1,^2) = 1 — 2a;ia;2- Then u(-) E { — 1, 1}. Fig. |6] 
indicates when the utility is 1. 

Consider now the following centralized algorithm: Every 
slot t, observe (wi(t),aj2(t)) and compute g{uji{t),ijj2{t))- 

• If g{ijJi{t),ijj2{t)) — 1, independently choose: 

(1,1) with probability 1/2 

(-1,-1) with probability 1/2 



(ai(t),«2(t)) = 



This extension to points on tlie boundaiy of a convex hull can be 
proven using Caratheodory's theorem together with the supporting hyperplane 
theorem for convex sets \2\'\. 
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Wi 


0^2 


g{uji,uj2) 


Conditions required for u = 1 





1 




1 




1 
1 
1 


a\ — a2 
a\ = a2 


1 


1 


-1 


OLi 7^ a2 



Fig. 6. A table showing the conditions needed for u(ai,a2,u]i,u)2) = 1. 

• If g{ijji{t),ijj2{t)) = —1, independently choose: 

(1,-1) with probability 1 /2 



(ai(i),a2(i)) - ., ^_^^ ^^ ^.jjj probability 1/2 
The randomization ensures that regardless of (wi(t), aj2(t)): 
Pr[«i(i) = l|u;i(i),W2(t)] = ^ 



Pr[a2(t) = lK(i),W2(i)] 



and hence the conditional independence assumption ( fTOb is 
satisfied. This algorithm guarantees the utiUty function is 1 
for all possible outcomes, and so the expected utility is also 1 . 
However, it can be shown that an optimal distributed algorithm 
is the pure strategy ai{t) = 02 (^) = 1 for all t (regardless of 
ijji{t),ijj2{t)), which yields an expected utility of only 1/2. 

Appendix C — Preferred Action Lemmas 

This appendix provides proofs of Lemmas [T]|ll The proofs 
of Lemmas [T] and |2] follow from the following lemma. 

Lemma 9: A penalty function p{a.^uj) has the preferred 
action property if it satisfies the following three properties: 

. Ai = {0,l}foxi£ {l,...,Af}. 

• p{a., uj) is non-increasing in the vector a;. That is, for all 
a. ^ A and all vectors u;, 7 G fJ that satisfy a; < 7 (with 
inequality taken entrywise), one has 

p{oL,u}) >p{a,-y) 

• Given ai — 0, p{a,u)) does not depend on cui. That is, 
for all i e {1, . . . , N}, all possible values of a^ e Aj, 
iv- E flj, and all w, 7 e Hi, one has: 

p{ [aj, 0] , [ujj, uj] ) = p{[aj, 0] , [ljj, 7] ) 

Proof: Fix i E {l,...,iV}, fix aj, u)-, and fix a, (3 E 
{0, 1}, u!,jE fli that satisfy a > f3 and w < 7. Since a,/3 
are binary numbers that satisfy a > /3, it must be that a = I, 
(3 = 0. The goal is to show: 

pi [a?, 1] , K: ^])-pi ["?: 0] > [^v ^] ) 
> p{[cxj, 1], [a;j,7]) -p([a-,0], [cJjn]) 

Since the second term on the left-hand-side is the same as the 
second term on the right-hand-side, it suffices to show: 

^(["1, 1], K>H) > Pi[(^v 1]' K'7]) 

The above inequality is true because w < 7 and p{cy.,u)) is 
non-increasing in the vector uj. ■ 

Proof: (Lemma [U Suppose: 



p{a,u;) 



■ N 

E 

.1=1 



(/)j(wi)ai,& 



where Ai = {0, 1} for i E {1, . . . , A^}, 6 is a real number, and 
all functions (piitOi) are non-decreasing in a;^. Then p{a,u)) 
is non-increasing in the u: vector. Furthermore, for any given 

i e {1, . . . , N}, any a- E A-, uij E %, and any w, 7 G ^i, 
one has: 



p([aj,0],[a;j,cj]) = -: 






= P([ai>0]:K,7]) 

Thus, p{a,ijS) satisfies the requirements of Lemma |9] ■ 

Proof: (Lemma 121) Suppose: 

N 

p{oL,u}) = -^a;iai]^(l - a.j) 

i=l j^i 

where ai E {0, 1} and uji E {0, 1, . . . , |J7i| — 1} for all i E 
{1, . . . , N}. Then p{a., lj) is non-increasing in the uj vector. 
Now fix i e {1, . . . , N}, fix a-, ujj, and fix w, 7 £ Hi. Then: 



p{[aj,0],[ujj,uj] 



k^i j^k 

= P(["l,0],[a;j,7]) 



Thus, p[oi.,uj) satisfies the requirements of Lemma |9] ■ 

Proof: (Lemma [3]) Suppose: 

N 

pia,uj) = Y\_(f'ii'^i)'ipi{(Xt) 
1=1 

where (l)i{uji) is non-negative and non-increasing in uji and 
ipi{ai) is non-negative and non-decreasing in ai. Fix i E 
{1,...,A^}, fix ctj, ujj, and fix a, (3 E Ai, ^,76 fli that 
satisfy a> (3 and w < 7. The goal is to show: 

P{ ["I, a] , [^v ^] ) - P( ["I' P\ ' K' '^l ) 
> p{ [aj, a] , [ujj, j])-p{ [oij, /3] , [w-, 7] ) 

By canceling common (non-negative) factors, it suffices to 
show: 

This is equivalent to: 

0,(w)(^.(a) - M^)) > M-f){Ma) - Ml3)) (42) 

Since a > (3 and ipiia) is non-decreasing, one has ipi{a) — 
V'i(/3) ^ 0. By canceling the common (non-negative) factor, 
it suffices to show: 

(j>,{uj) > 0,(7) 

This is true because oj < 7 and (/>,;(aj) is non-increasing. ■ 
Proof: (Lemma m Suppose: 



p{ct,uj) 



R 

E 



WrPr{cx, Uj) 



where w^ are non-negative constants, and each function 
Pr{a,uj) has the preferred action property. Fix i E 
{1,...,A^}, fix ctj, ujj, and fix a, (3 E Ai, uj,j E Hi that 
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satisfy a > /3 and a; < 7. Since each function pr{a.,u)) has 
the preferred action property, one has for all r e {1, . . . , R}: 

Pr ( ["I, a] , [wj, uj])~Pr{ [aj, /3] , [wj, w] ) 
> Pr ( [aj, a] , [w-, 7] ) - Pr ( ["I, /3] , [t^i, 7] ) 

Multiplying the above inequality by Wr and summing over 
r e {1, . . . , R} proves that p{a, lj) has the preferred action 
property. ■ 

VIII. Appendix D — The Slater condition 

For a given real number e > 0, consider the following linear 
program that is related to the linear program (fT6]l-(fT9]i: 



Minimize: 



^Af 



,(™) 



E^UOrnrr (43) 



Subject to: Y.rLi ^™^i™^ < Cfc - e Vfc € {1, . . . , K}{44-) 



> Vme {1,...,M} 



M 



m=l 



= 1 



(45) 
(46) 



If e > 0, the penalty constraints are tighter above than in 
the linear program (fT6]l-(fT9]l (compare ( |44] | and (fTTJi). Define 
G(e) as the the optimal objective value ( l43T l as a function of 
the parameter e. Then G(0) = Po'° , where Pq^ corresponds 
to the original linear program (fT6]l-(fT9]l. Define tmax as the 
largest value of e for which (|43Tl-(|46]l is feasible. Suppose 
^max > 0. This means it is possible to satisfy the desired time 
average penalty constraints with a slackness of emax in each 
constraint k e {1, . . . ,K}. The condition emax > is called 
the Slater condition li22J . 

For simplicity of exposition, assume D = 0. Since the drift- 
plus-penalty algorithm takes actions that minimize the right- 
hand-side of ( |26] l over all probability mass functions jim{t), 
one has: 



¥.[l^{t) + Vpo{t)\Q{t)]<B 



M 



K 



yj: 



„("^) 



'm' 



E^fcW 



fc=i 



M 

E 



„(»") 



'm' k 



Ck 



for any values dm that satisfy (|45]|-(|46]|. Using 9m values that 
solve (|43]|-(|46]| for the case e = emax gives: 

E[A{t) + Vp„it)\Q{t)]<B 



K 



y(^{^max) 



E^^w 



fc=i 



Therefore, for all slots ie{0,l,2,...} one has: 

K 

E [A{t)\Q{t)] <B + FV- emax E Qk{t) (47) 

fc=i 
where i^ is a constant that satisfies the following for all slots 
t and all possible values of Q{t): 

F>G{emax)~E[po{t)\Q{t)] 

Now define Smax as the largest possible change in ||Q(i)|| 
from one slot to the next, so that regardless of the control 
decisions, one has: 

IWQit + 1)\\ - \\Qm\ < Smax Vt€ {0,1,2,...} (48) 



Such a value Smax exists because all penalty functions 
Pk{ct{t),u){t)) are bounded. 

Lemma 10: Let Smax be a positive value that satisfies ( |48] ). 
Let A be a non-negative real number, and let e > 0. Assume 
||Q(0)|| =0 with probability 1, and that for all slots t and all 
possible Q{t) one has: 

K 

E[A{t)\Q{t)]<A-eY,Qk{t) (49) 

fc=i 



Then for all slots i e {1, 2, . . .}: 






E[||Q(i)||]< 




max 


riog(2) 

, max 

r 


'2A e 
e ' 2 


1 log(2t[e'-^'"»- 
r 


-1])1 


where r is defined: 


e 


(S 


nax + eOmax/3 


yJ 



Using A = B + FV in ( |47] i shows that the system under 
study satisfies the requirements of the above lemma, which 
proves that OTI) holds. The proof of the above lemma relies 
heavily on drift analysis in [[23 1 and results for exponentiated 
martingales in f241. 

Proof: (Lemma [TOli Suppose that: 



\\Qit)\\ > max [2^/e, e/2 
By definition of A{t), one has from (|49] i: 

E[||Q(i + l 



(51) 



< 



'lQ(i)] 

+ 2A~2e"^Qk{t) 



K 



< 
< 
< 



(52) 
(53) 



mm' 

fc=i 

\Q{t)\\'+2A-2e\\Qit)\\ 
\Q{tW~e\\Qm 

:\\Qm^e/2f 

where (|52] i holds because the sum of the components of a 
non-negative vector is greater than or equal to its norm, and 
(|53] | holds because (ISTT i implies e||Q(t))|| > 2 A. By Jensen's 
inequality: 

E[\\Qit + l)\\\Q{t)f<{\\Qit)\\-e/2f 

Taking the square root of both sides and using ( BTT i gives: 

E[||g(i + l)|Q(t)||]<||Q(t)||-6/2 (54) 

Define C by: 

C4 max [2A/e, e/2] 

so that (|54]i holds whenever ||Q(i)|| > C. Define S{t) by: 

S{t)A\\Q{t + l)\\-\\Q{t)\\ 

and note that \S{t)\ < Smax for all t. It follows that: 

-e/2 if||Q(t)||>C 
max Otherwise 



E[Sit)\Q{t)] < 



(55) 



Define Y{t) — e'''''^'*!! for a positive value of r to be 
determined. Assume that r satisfies: 



< rSmax < 3 



(56) 
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Then: 



Y{t + 1)-Y{t) = e''ll^(*)"e'^'^(*)-r(t) 



= Y{t)\ 



^rS{t) 



-11 



By Jensen's inequality for the convex function e^ one has: 

e-E[||Q(t)||l_i<grCjgr5__^]^ 

Thus: 



< 



r(i)[e'^^(*)-l] if||Q(i)||>C 
- 1] otherwise 



grCFgra. 



Now define g{x) as the function that satisfies the following 
for all real numbers x: 



rE[||Q(i)||] < log(l + e'-^[e'-^— -l]i) 



< max[log(2),log(2. 

< max[log(2), rC + log(2t[e 



m] 



rS„ 



m 



€"=-! = x + Yff(a;) 



(57) 



Dividing the above by r gives the following, which holds for 
all integers t > 0: 



By results in 11241 . the function g{x) is non-decreasing in x 
and satisfies: 

1 



log(2) ^_^log{2t\e-^ 



11 



It follows from ^1} that: 



l-x/3 



Vx e [0, 3) 



(58) 



e^^(*)-l = rSit)+^-^g{r5it)) 



< r6{t) + 

< rd{t) + 



(rS„ 



2 



-g{r6.max) 
^2 



2(1 - rS„iax/3) 



where the final inequality uses (|58] |. which is justified because 
rSmax satisfies (|56] |. Thus: 



Y{t + l)-Y{t) 



< 



Y{t)[r6{t) 



(r5„ 



-11 



if||Q(t)||>C 
otherwise 



Taking expectations and using (|55T l gives: 

E[Y{t + l)-Y{t)\Q{t)] 



IE[||Q(i)||] <max 



Appendix E — The constant in Theorem[3] 

This appendix proves the inequality involving the 2BD 
constant at the end of the proof of Theorem |3] From ( |25] ) 
one has for all queues fc e {1, 2, . . . , K} and all slots r: 

\Qk{T + 1) - Qk{T)\ < \pk{T ~D)-Ck\ 

Thus, for all slots t: 

D 

\Qk{t + D)-Qk{t)\ < J2\Qk{t + d)-Qk{t + d-l)\ 

d=l 
D 

< X!bfc(< + rf-l-i?)-Cfc| 

(1=1 
D 

= ^\Pkitd)-Ck\ 



d=l 



Y{t)[^ 



ir5r^a..f 



< i ' ^"^L 2 ' 2(l-r5„„,/3)J 



if ||Q(i)|| >C 

Otherwise 



Now choose r so that: 

re 



where for notational simplicity t^ has been defined: 

td=t + d-l-D 
Thus: 



K 



y^^max ) 



2 2(1 - rS^ax/^) 
This holds for r as defined in ( |50] l, and this choice of r 



fe=i 



maintains the inequaUty (|56] l. Thus: 

E[y(t + i)-r(t)|Q(t)] 



^(Qfe(t + i?)-Qfc(f))(pfe(t)-Cfc) 

< ^^\Pk{td) - Ck\\pk{t) - Ck\ 



k=l d=l 



< 





grCrgr5„ 



if||g(i)||>C 

1] Otherwise 



Taking expectations of the above and using the Cauchy- 
Schwartz inequality^ 



Therefore, for all slots t: 



K 



.k=l 



¥.[Y{t + l)-Y{t)] < 



^rC\rSmax 



Summing the above over Te{0,l,...,i— 1} for some integer 
i > gives: 



^(Ofe(t + D)- Qk{t)){pk{t) - Ck) 
=1 

K D 

< 5]5]VE[k-(M-CfeP]VE[bfe(t)-c,P] 



fc=i d=i 

D 



.[Y{t)]-E[Ym<e 



rC \ ^rS„ 



lU 



< 



E 



Since r(0) = 1 with probability 1, and Y{t) = e'^HQC*)!!^ one 



K 



1 \ k=l 



Y,n\pk(u) - ckV 



K 



Y.n\Pk{t)-cuV] 
\ fc=i 



has: 



E 



,r\\Q{t) 



- 1 < e^'^fe'''^" 



Hi 



'Strictly speaking, these expectations should be conditioned on Q{t) to 
match with the inequalities at the end of Theorem[3] That explicit conditioning 
has been suppressed to simplify the expressions. 
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where the final inequality follows because the inner product 
of two vectors is less than or equal to the product of norms. 
The right hand side is less than or equal to: 



[4 
[5 

[6: 

[7 
[8 

[9: 

[lo: 
[11 

[12 
[13 

[14: 

[15 

[16: 

[17 

[18 
[19 

[2o: 

[21 

[22: 

[23 
[24: 



2BD 



d=i 



References 

M. J. Neely. Stochastic Network Optimization with Application to 
Communication and Queueing Systems. Morgan & Claypool, 2010. 

B. Liu, P. Terlecky, A. Bar-Noy, R. Govindan, M. J. Neely, and 
D. Rawitz. Optimizing information credibility in social swarming 
applications. IEEE Trans, on Parallel and Distributed Systems, vol. 
23, no. 6, pp. 1147-1158, June 2012. 

N. Michelusi and M. Zorzi. Optimal random multiaccess in energy har- 
vesting wireless sensor networks. Proc. IEEE International Conference 
on Communications, to appear. 

M. J. Osborne and A. Rubinstein. A Course in Game Theory. MIT 
Press, Cambridge, MA, 1994. 

Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, 
Game-Theoretic, and Logical Foundations. Cambridge University Press, 
NY, NY, 2009. 

R. Aumann. Subjectivity and correlation in randomized strategies. 
Journal of Mathematical Economics, vol. 1, pp. 67-96, 1974. 
R. Aumann. Correlated equilibrium as an expression of bayesian 
rationality. Econometrica, vol. 55, pp. 1-18, 1987. 
L. Georgiadis, M. J. Neely, and L. Tassiulas. Resource allocation and 
cross-layer control in wireless networks. Foundations and Trends in 
Networking, vol. 1, no. 1, pp. 1-149, 2006. 

M. J. Neely, E. Modiano, and C. Li. Fairness and optimal stochastic 
control for heterogeneous networks. IEEE/ACM Transactions on Net- 
working, vol. 16, no. 2, pp. 396-409, April 2008. 
X. Lin and N. B. Shroff. Joint rate control and scheduling in multihop 
wireless networks. Proc. of 43rd IEEE Conf. on Decision and Control, 
Paradise Island, Bahamas, Dec. 2004. 

L. Xiao, M. Johansson, and S. P. Boyd. Simultaneous routing and 
resource allocation via dual decomposition. IEEE Transactions on 
Communications, vol. 52, no. 7, pp. 1136-1144, July 2004. 
S. H. Low and D. E. Lapsley. Optimization flow control, i: Basic 
algorithm and convergence. IEEE/ACM Transactions on Networking, 
vol. 7 no. 6, pp. 861-875, Dec. 1999. 

M. Rabbat and R. Nowak. Distributed optimization in sensor networks. 
Proc. IPSN, 2004. 

M. J. Neely. Distributed and secure computation of convex programs 
over a network of connected processors. DCDIS Conf, Guelph, Ontario, 
July 2005. 

C. C. Moallemi and B. Van Roy. Distributed optimization in adaptive 
networks. Advances in Neural Information Processing Systems, vol. 16, 
MIT Press, 2004. 

L. Jiang and J. Walrand. A distributed csma algorithm for throughput 
and utility maximization in wireless networks. Proc. Allerton Conf. on 
Communication, Control, and Computing, Sept. 2008. 
S. Rajagopalan and D. Shah. Reversible networks, distributed optimiza- 
tion, and network scheduling: What do they have in common? Proc. 
Conf. on Information Sciences and Sytems (CISS), 2008. 
L. Jiang and J. Walrand. Scheduling and Congestion Control for Wireless 
and Processing Networks. Morgan & Claypool, 2010. 
A. Nayyar. Sequential Decision Making in Decentralized Systems. PhD 
thesis. University of Michigan, 2011. 

M. J. Neely, S. T. Rager, and T F. La Porta. Max weight learning 
algorithms for scheduling in unknown environments. IEEE Transactions 
on Automatic Control, vol. 57, no. 5, pp. 1179-1191, May 2012. 

D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar Convex Analysis and 
Optimization. Boston: Athena Scientific, 2003. 

D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, 

MA, 1995. 

L. Huang and M. J. Neely. Delay reduction via Lagrange multipliers 

in stochastic network optimization. IEEE Transactions on Automatic 

Control, vol. 56, no. 4, pp. 842-857, April 2011. 

F. Chung and L. Lu. Concentration inequalities and martingale 

inequalities-a survey. Internet Mathematics, vol. 3, pp. 79-127, 2006. 



